Biopython

Input and Output

In this chapter we'll be looking at the functionality that Biopython provides for reading and writing files. This is all handled by the Bio.SeqIO module so let's start by importing it:

In [1]:
from Bio import SeqIO

Files for this chapter

For this section you need to download some files which we will be reading in. You can either download them by hand from using these links: V01408.1.fasta, ls_orchid_short.gbk and ls_orchid.gbk or run the following Python code:

In [2]:
import urllib
for f in ["V01408.1.fasta", "ls_orchid_short.fasta", "ls_orchid.gbk"]:
    urllib.request.urlretrieve(f"https://milliams.com/courses/biopython/{f}", f)

SeqRecord objects

If you have a FASTA file with a single sequence in it, the simplest way to read it in is with SeqIO.read. This takes two arguments, the name of the file you want to open, and the format that file is in.

In [3]:
tm = SeqIO.read("V01408.1.fasta", "fasta")
print(tm)
ID: ENA|V01408|V01408.1
Name: ENA|V01408|V01408.1
Description: ENA|V01408|V01408.1 Tobacco mosaic virus genome (variant 1)
Number of features: 0
Seq('GTATTTTTACAACAATTACCAACAACAACAAACAACAAACAACATTACAATTAC...CCA')

The read functions returns an object called a SeqRecord. These are like the Seqs that we saw before but have additional information associated with them. A full list of these are:

.seq
The sequence itself, typically a Seq object as we saw in the last chapter.
.id
The primary ID used to identify the sequence.
.name
A "common" name/id for the sequence. In some cases this will be the same as the ID, but it could also be a clone name.
.description
A human readable description or expressive name for the sequence.
.letter_annotations
Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores or secondary structure information (e.g. from alignment files).
.annotations
A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more "unstructured" information to the sequence.
.features
A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence).
.dbxrefs
A list of database cross-references as strings.

So you can get at the Seq object fom inside the record with:

In [4]:
tm.seq
Out[4]:
Seq('GTATTTTTACAACAATTACCAACAACAACAAACAACAAACAACATTACAATTAC...CCA')

Exercise

Have a look at the value of some of these attributes of the tm object. Which are available and which are missing or empty?

Reading in multiple sequences

It is very common to have files which contain multiple sequences. Biopython provides an interface to read these in, one after another which is the SeqIO.parse function. This provides you with an object which you can loop over with a for loop:

In [5]:
for record in SeqIO.parse("ls_orchid_short.fasta", "fasta"):
    print(record.description)
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA

Each time around the loop, you are given a SeqRecord object which will work in exactly the same way as in the previous section, so getting the .description works fine.

Exercise

Load in the ls_orchid_short.fasta file and print out the length of each sequence, along with its description.

The type of the object returned by SeqIO.parse is known as a generator. One of the features of a generator in Python is that you can only loop over them once before they are exhausted. For example you can loop over them once without issue:

In [6]:
records = SeqIO.parse("ls_orchid_short.fasta", "fasta")
print([r.id for r in records])
['gi|2765658|emb|Z78533.1|CIZ78533', 'gi|2765657|emb|Z78532.1|CCZ78532', 'gi|2765656|emb|Z78531.1|CFZ78531', 'gi|2765655|emb|Z78530.1|CMZ78530', 'gi|2765654|emb|Z78529.1|CLZ78529']

But if you try to use the same object again, you will see that you get nothing back:

In [7]:
print([r.id for r in records])
[]

The reason for this is that it allows you to access, one at a time, a very long list of sequences, potentially more than can fit in memory at once since it only loads them one at a time.

If you know you have a short set of sequences (as we have in this tutorial) then you can load them all into a Python list and access them however you wish:

In [8]:
records = list(SeqIO.parse("ls_orchid_short.fasta", "fasta"))
print([r.id for r in records])
['gi|2765658|emb|Z78533.1|CIZ78533', 'gi|2765657|emb|Z78532.1|CCZ78532', 'gi|2765656|emb|Z78531.1|CFZ78531', 'gi|2765655|emb|Z78530.1|CMZ78530', 'gi|2765654|emb|Z78529.1|CLZ78529']
In [9]:
print([r.id for r in records])
['gi|2765658|emb|Z78533.1|CIZ78533', 'gi|2765657|emb|Z78532.1|CCZ78532', 'gi|2765656|emb|Z78531.1|CFZ78531', 'gi|2765655|emb|Z78530.1|CMZ78530', 'gi|2765654|emb|Z78529.1|CLZ78529']

Genbank files

As well as FASTA files, Biopython can read GenBank files. All you need to do is specify the filetype when calling the SeqIO.parse function. If you pass "genbank" (or "gb") as the second argument then it will read it as a GenBankfile:

In [10]:
record_iterator = SeqIO.parse("ls_orchid.gbk", "genbank")

If you are loading a file with multiple sequences in, you can grab just the first one with Python's next function. This gives you whatever would be available the first time around the loop:

In [11]:
first_record = next(record_iterator)

GenBank fiiles usually contain a lot more information than a FASTA so more of the fields of the SeqRecord will be filled in. This means that we can, for example, see the annotations that the sequecne has:

In [12]:
first_record.annotations
Out[12]:
{'molecule_type': 'DNA',
 'topology': 'linear',
 'data_file_division': 'PLN',
 'date': '30-NOV-2006',
 'accessions': ['Z78533'],
 'sequence_version': 1,
 'gi': '2765658',
 'keywords': ['5.8S ribosomal RNA',
  '5.8S rRNA gene',
  'internal transcribed spacer',
  'ITS1',
  'ITS2'],
 'source': 'Cypripedium irapeanum',
 'organism': 'Cypripedium irapeanum',
 'taxonomy': ['Eukaryota',
  'Viridiplantae',
  'Streptophyta',
  'Embryophyta',
  'Tracheophyta',
  'Spermatophyta',
  'Magnoliophyta',
  'Liliopsida',
  'Asparagales',
  'Orchidaceae',
  'Cypripedioideae',
  'Cypripedium'],
 'references': [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...),
  Reference(title='Direct Submission', ...)]}

Exercise

Take a look at the record and see what other SeqRecord attributes are filled in which were missing from the FASTA file we loaded earlier.

Output

We’ve talked about using Bio.SeqIO.parse for sequence input (reading files), and now we’ll look at Bio.SeqIO.write which is for sequence output (writing files). This is a function taking three arguments: some SeqRecord objects, a handle or filename to write to, and a sequence format.

Here is an example, where we start by creating a few SeqRecord objects the hard way (by hand, rather than by loading them from a file):

In [13]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

rec1 = SeqRecord(
    Seq(
        "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
        "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
        "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
        "SSAC",
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
)

rec2 = SeqRecord(
    Seq(
        "YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
        "DMVVVEIPKLGKEAAVKAIKEWGQ",
    ),
    id="gi|13919613|gb|AAK33142.1|",
    description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)

my_records = [rec1, rec2]

Now we have a list of SeqRecord objects, we’ll write them to a FASTA format file:

In [14]:
SeqIO.write(my_records, "my_example.faa", "fasta")
Out[14]:
2

The 2 that gets returned tells you how many records were written.

Exercise

Create a SeqRecord by hand and write it to a .fasta file. Then read that file in and check that the details match.